Introduction to Distributional Regression

Lecture 3

Gillian Heller

NHMRC Clinical Trials Centre, University of Sydney

gillian.heller@sydney.edu.au

An example

Dutch boys BMI

  • The Dutch boys BMI dataset consists of the ages and BMIs of 7,294 Dutch boys between birth and 22 years.

  • Clearly a linear model will be inadequate.

Dutch boys BMI

  • The Dutch boys BMI dataset consists of the ages and BMIs of 7,294 Dutch boys between birth and 22 years.

  • Clearly a linear model will be inadequate.

  • To start, we analyse BMI for boys aged between 10 and 20, which is approx linear growth

Dutch boys BMI

A linear model

\begin{align*} y_i&=\text{ BMI of boy }i, \quad x_i=\text{ age of boy }i\\ y_i&=\beta_0+\beta_1 x_i+\epsilon_i\\ \epsilon_i&\sim\mathcal{N}(0,\sigma^2) \end{align*}

Code
dbbmi |> filter(age>10 & age < 20) -> dbbmi10

m1 <- lm(bmi ~ age, data=dbbmi10)
qqnorm(resid(m1))
qqline(resid(m1))

Fitting a distributional regression model

We need to

  1. Select the response distribution

  2. Select covariates for all of the model parameters of the chosen distribution

  3. Check model fit

  4. Go back to 1.

This is a complex process

1. Select the response distribution

Functions for fitting distributions

gamlss

gamlss() fits a distribution on the response using penalized maximim likelihood (RS or CG or mixed algorithms)

  m <- gamlss(y ~ x1 + x2,
          sigma.formula =~ x1 + x2,
          nu.formula =~ x2 + x3,
          tau.formula =~ x2 + x3,
          data = dataset, 
          family = NO, 
          n.cyc=20, ...)


  • histDist() plots a histogram of the response with the specified distribution

  • fitDist() fits a set of distributions to the response and chooses the one with the smallest GAIC (marginal fit)

  • chooseDist() fits a set of distributions on the specified model and chooses the one with the smallest GAIC (conditional fit)

Functions for fitting distributions

gamlss2

gamlss2() fits a distribution on the response using penalized maximim likelihood (RS or CG or mixed algorithms)

m <- gamlss2(y ~ x1 + x2 |.|x2 + x3|.,
        data = dataset, 
        family = NO, 
        maxit=20, ...)
  • find_family() fits a set of distributions on an intercept-only (i.e. marginal) model and chooses the one with the smallest GAIC

  • find_gamlss2(formula, families = available_families(type="continuous"), k = 2, select = FALSE, ...) selects a distribution conditional on formula according to GAIC

  • find_gamlss2(formula, families = available_families(type="continuous"), k = 2, select = TRUE, ...) selects a distribution and variables according to GAIC

Choosing distributions

  • Global deviance

\begin{split} \text{GDEV}\,=& -2\ell(\boldsymbol{\hat{\theta}}) \\ =& -2 \sum_{i=1}^n \log f(y_i |\boldsymbol{\hat{\theta}}) \end{split}

  • Two well-known criteria

\begin{split} {\rm AIC} &= \text{GDEV}+2 \, \text{df}\\ {\rm BIC} &= \text{GDEV}+ \log n \cdot \text{df} \end{split}

  • Generalized Akaike information criterion

\text{GAIC}(k) = \text{GDEV} + k \cdot \text{df}

Dutch boys BMI

The linear model in gamlss2

library(gamlss2)
m2 <- gamlss2(bmi ~ age, data = dbbmi10, family = NO)
GAMLSS-RS iteration  1: Global Deviance = 16958.7855 eps = 0.069449     
GAMLSS-RS iteration  2: Global Deviance = 16958.7855 eps = 0.000000     
summary(m2)
Call:
gamlss2(formula = bmi ~ age, data = dbbmi10, family = NO)
---
Family: NO 
Link function: mu = identity, sigma = log
*--------
Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
mu.(Intercept)    10.87058    0.22405   48.52   <2e-16 ***
mu.age             0.57843    0.01487   38.90   <2e-16 ***
sigma.(Intercept)  0.90673    0.01171   77.43   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*--------
n = 3646 df =  3 res.df =  3643
Deviance = 16958.7855 Null Dev. Red. = 6.94%
AIC = 16964.7855 elapsed =  0.05sec

Dutch boys BMI

The linear model in gamlss2

par(mfrow=c(1,2))
plot(m2, which = "qq-resid")
plot(m2, which = "wp-resid")

Dutch boys BMI

  • Clearly the normal linear model provides a very bad fit

  • We need to find another distribution for BMI

  • The following code chooses the best-fitting continuous distribution using the AIC (marginal fit)

f <- find_family(dbbmi10$bmi, k=2, families = available_families(type = "continuous"),
                 verbose = FALSE)
.. .. IC = 17884.13 
.. .. IC = 17884.13 
.. .. error
.. .. IC = 17882.88 
.. .. IC = 17882.88 
.. .. error
.. .. IC = 17886.13 
.. .. IC = 17886.13 
.. .. error
.. .. error
.. .. error
.. .. IC = 17956.21 
.. .. IC = 17920.96 
.. .. IC = 28931.35 
.. .. IC = 18008.99 
.. .. IC = 90191.43 
.. .. error
.. .. IC = 17955.96 
.. .. IC = 17883.39 
.. .. IC = 17905.94 
.. .. IC = 29273.12 
.. .. IC = 18133.24 
.. .. IC = 19819.34 
.. .. IC = 17940.46 
.. .. IC = 55458.11 
.. .. IC = 17885.2 
.. .. IC = 18032.16 
.. .. error
.. .. IC = 18167.88 
.. .. error
.. .. IC = 17940.32 
.. .. IC = 17940.32 
.. .. IC = 19136.73 
.. .. error
.. .. IC = 18228.47 
.. .. IC = 18228.47 
.. .. IC = 18230.47 
.. .. IC = 36755.01 
.. .. IC = 37253.67 
.. .. error
.. .. IC = 29273.12 
.. .. IC = 31709.55 
.. .. IC = 18399.27 
.. .. IC = 18191.22 
.. .. IC = 17917.08 
.. .. error
.. .. IC = 18091.67 
.. .. IC = 17967.2 
.. .. IC = 18091.67 
.. .. IC = 17912.58 
.. .. IC = 17883.73 
.. .. IC = 17922.92 
.. .. IC = 17933.34 
.. .. IC = 17914.67 
.. .. error
.. .. IC = 18230.48 
.. .. IC = 17915.17 
.. .. IC = 17905.84 
.. .. IC = 17974.12 
.. .. IC = 17990.73 
.. .. IC = 17906.96 
.. .. IC = 17906.96 
.. .. IC = 17995.27 
.. .. IC = 17964.2 
.. .. IC = 18154.08 
.. .. IC = 18154.13 
.. .. IC = 20507.78 
.. .. IC = 19694.15 
.. .. IC = 20821.08 
f
     GAF   IGAMMA  PARETO1   PARETO PARETO2o       GP  PARETO2      EXP 
90191.43 55458.11 37253.67 36755.01 31709.55 29273.12 29273.12 28931.35 
    WEI3      WEI       GU     WEI2     LQNO       PE      SN1      NOF 
20821.08 20507.78 19819.34 19694.15 19136.73 18399.27 18230.48 18230.47 
     NO2       NO      PE2       LO      TF2       TF       GT     SEP2 
18228.47 18228.47 18191.22 18167.88 18154.13 18154.08 18133.24 18091.67 
     SEP     JSUo       GA      ST4      ST2      ST1     SEP1      ST5 
18091.67 18032.16 18008.99 17995.27 17990.73 17974.12 17967.20 17964.20 
    EGB2      GB2       IG    LOGNO   LOGNO2   SHASHo    SHASH   exGAUS 
17956.21 17955.96 17940.46 17940.32 17940.32 17933.34 17922.92 17920.96 
      RG      SN2  SHASHo2     SEP3      ST3     ST3C      GIG      SST 
17917.08 17915.17 17914.67 17912.58 17906.96 17906.96 17905.94 17905.84 
    BCTo      BCT      JSU     BCCG    BCCGo     SEP4       GG     BCPE 
17886.13 17886.13 17885.20 17884.13 17884.13 17883.73 17883.39 17882.88 
   BCPEo 
17882.88 

Dutch boys BMI

The chosen distribution is BCPEo (Box-Cox Power Exponential):

histDist(dbbmi10$bmi, family = "BCPEo")


Family:  c("BCPEo", "Box-Cox Power Exponential-orig.") 
Fitting method: "nlminb" 

Call:  gamlssML(formula = dbbmi10$bmi, family = "BCPEo") 

Mu Coefficients:
[1]  2.948
Sigma Coefficients:
[1]  -1.929
Nu Coefficients:
[1]  -0.7292
Tau Coefficients:
[1]  0.7624

 Degrees of Freedom for the fit: 4 Residual Deg. of Freedom   3642 
Global Deviance:     17874.9 
            AIC:     17882.9 
            SBC:     17907.7 

Dutch boys BMI

  • It makes more sense to choose the best-fitting conditional distribution.

  • We specify age as the covariate for all distribution parameters

  • I am using the gamlss functions here because gamlss2 gave errors.

m0 <- gamlss(bmi ~ age,
               sigma.formula =~ age,
               nu.formula =~ age,
               tau.formula =~ age,
               family = NO,
               data=dbbmi10,
               n.cyc = 100)
GAMLSS-RS iteration 1: Global Deviance = 16945.38 
GAMLSS-RS iteration 2: Global Deviance = 16945.38 
GAMLSS-RS iteration 3: Global Deviance = 16945.38 
f <- chooseDist(m0, type="realAll")
minimum GAIC(k= 2 ) family: exGAUS 
minimum GAIC(k= 3.84 ) family: exGAUS 
minimum GAIC(k= 8.2 ) family: exGAUS 
f |>
  data.frame() |>
  dplyr::arrange(X2)
               X2    X3.84     X8.2
exGAUS   16288.32 16299.36 16325.52
EGB2     16294.60 16309.32 16344.20
SEP4     16295.21 16309.93 16344.81
JSU      16295.59 16310.31 16345.19
SHASH    16298.27 16312.99 16347.87
GB2      16299.84 16314.56 16349.44
BCPE     16301.72 16316.44 16351.32
ST1      16302.12 16316.84 16351.72
JSUo     16302.78 16317.50 16352.38
BCT      16302.97 16317.69 16352.57
ST5      16303.81 16318.53 16353.41
ST2      16304.41 16319.13 16354.01
BCCG     16304.43 16315.47 16341.63
BCPEo    16304.79 16319.51 16354.39
BCTo     16306.36 16321.08 16355.96
BCCGo    16307.62 16318.66 16344.82
GG       16311.38 16322.42 16348.58
RG       16318.55 16325.91 16343.35
SHASHo2  16318.85 16333.57 16368.45
SHASHo   16318.86 16333.58 16368.46
SST      16323.33 16338.05 16372.93
ST3      16324.00 16338.72 16373.60
SEP1     16328.94 16343.66 16378.54
SEP3     16330.91 16345.63 16380.51
SEP2     16331.19 16345.91 16380.79
SN1      16377.49 16388.53 16414.69
ST4      16387.83 16402.55 16437.43
SN2      16426.90 16437.94 16464.10
IGAMMA   16464.82 16472.18 16489.62
LOGNO2   16553.75 16561.11 16578.55
LNO      16553.75 16561.11 16578.55
LOGNO    16553.75 16561.11 16578.55
IG       16557.30 16564.66 16582.10
GT       16652.42 16667.14 16702.02
TF2      16658.73 16669.77 16695.93
TF       16659.35 16670.39 16696.55
GA       16666.47 16673.83 16691.27
LO       16695.28 16702.64 16720.08
NET      16703.89 16711.25 16728.69
PE       16723.42 16734.46 16760.62
PE2      16723.84 16734.88 16761.04
NO       16953.38 16960.74 16978.18
WEI      17978.46 17985.82 18003.26
WEI3     17978.51 17985.87 18003.31
WEI2     18177.15 18184.51 18201.95
GU       18980.03 18987.39 19004.83
EXP      28908.71 28912.39 28921.11
PARETO2o 28963.20 28970.56 28988.00
PARETO2  28984.92 28992.28 29009.72
GP       28984.92 28992.28 29009.72
GIG            NA       NA       NA

Dutch boys BMI

m.exGAUS <- gamlss2(bmi ~ age | age | age,
              data=dbbmi10, family=exGAUS)
GAMLSS-RS iteration  1: Global Deviance = 16748.3316 eps = 0.249466     
GAMLSS-RS iteration  2: Global Deviance = 16595.5891 eps = 0.009119     
GAMLSS-RS iteration  3: Global Deviance = 16468.0303 eps = 0.007686     
GAMLSS-RS iteration  4: Global Deviance = 16380.1908 eps = 0.005333     
GAMLSS-RS iteration  5: Global Deviance = 16327.6312 eps = 0.003208     
GAMLSS-RS iteration  6: Global Deviance = 16300.2541 eps = 0.001676     
GAMLSS-RS iteration  7: Global Deviance = 16287.0502 eps = 0.000810     
GAMLSS-RS iteration  8: Global Deviance = 16281.0575 eps = 0.000367     
GAMLSS-RS iteration  9: Global Deviance = 16278.385 eps = 0.000164     
GAMLSS-RS iteration 10: Global Deviance = 16277.2164 eps = 0.000071     
GAMLSS-RS iteration 11: Global Deviance = 16276.7055 eps = 0.000031     
GAMLSS-RS iteration 12: Global Deviance = 16276.4822 eps = 0.000013     
GAMLSS-RS iteration 13: Global Deviance = 16276.3889 eps = 0.000005     
summary(m.exGAUS)
Call:
gamlss2(formula = bmi ~ age | age | age, data = dbbmi10, family = exGAUS)
---
Family: exGAUS 
Link function: mu = identity, sigma = log, nu = log
*--------
Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
mu.(Intercept)     8.731104   0.276086  31.625  < 2e-16 ***
mu.age             0.581581   0.019263  30.192  < 2e-16 ***
sigma.(Intercept) -0.583069   0.160032  -3.643 0.000273 ***
sigma.age          0.056686   0.010409   5.446 5.50e-08 ***
nu.(Intercept)     0.760758   0.152497   4.989 6.36e-07 ***
nu.age            -0.001436   0.010458  -0.137 0.890780    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*--------
n = 3646 df =  6 res.df =  3640
Deviance = 16276.3889 Null Dev. Red. = 9.15%
AIC = 16288.3889 elapsed =  1.08sec

Dutch boys BMI

plot(m.exGAUS, which = "wp-resid")

2. Select covariates for all of the model parameters of the chosen distribution

Model selection

Recall the GAMLSS model

\begin{align*} y_{i}&\sim \textcolor{red}{\mathcal{D}\left(\theta_{i1},\ldots,\theta_{iK}\right)} \\ \textcolor{green}{g_k(\theta_{ik})}&=\beta_0^{k}+\textcolor{blue}{s_1^{k}(x_{i1})+\ldots+s_p^{k}( x_{ip})}\qquad k=1,\ldots,K \end{align*}

  • Assume we have chosen \mathcal{D}\left(\theta_{i1},\ldots,\theta_{iK}\right)

  • We now need to choose the terms s_1^{k}(x_{1}),\ldots,s_p^{k}( x_{p}) for each distribution parameter \theta_k.

  • We firstly define the options for the terms s^k_j(x_j)

  • Then we discuss strategies for model selection

Model terms s^k_j(x_j)

  • parametric e.g. x_j, \log(x_j), x_j^2, etc

  • smooth term (splines)

  • random effect

  • spatial effect

All of these need to be expressed as linear functions

Splines

  • Dropping the indices j and k, each function is expanded in a basis as

s({x}_i)=\sum_{\ell=1}^L \gamma_\ell \boldsymbol{B}_\ell ({x}_i)

where

  • B_\ell ({x}_i) represent different types of basis functions

  • \gamma_\ell are the basis amplitudes

  • In matrix notation, each of the predictors can then be written for all observations as

\boldsymbol{\eta}=\beta_0 \mathbf{1}_n+\boldsymbol{B}_1 \gamma_1+\ldots+\boldsymbol{B}_{\jmath} \gamma_{\jmath} .

Splines

  • Approximate functions s(x) based on a linear combination of B-spline functions:

s(x)=\sum_{\ell=1}^L \gamma_\ell B_\ell(x)

Splines

  • Estimated B-splines for a varying number of knots:

Splines

  • Estimates crucially depend on the number of basis functions.

\Rightarrow Add a regularization term that penalizes function estimates with high variability.

  • Effectively leads to a trade-off between fidelity to the data and smoothness.
  • Popular approach (e.g. for smoothing splines): Use a penalty based on the integrated squared second derivative:

\operatorname{pen}(s)=\lambda \int\left(s^{\prime \prime}(x)\right)^2 d x

  • The smoothing parameter \lambda determines the impact of the penalty and should be estimated jointly with the regression coefficients.

Coding of terms

Term gamlss gamlss2
parametric x1 + x2^2 + log(x3) x1 + x2^2 + log(x3)
nonlinear (spline) pb(x1) s(x1)
cyclic spline pbc(x1)
random intercept re(random =~ 1|id) s(id, bs="re")
random slope re(random =~ x1|id) s(id, x1, bs="re")

Example: Dutch boys’ BMI, again

  • Fit a BCTo model with only a smooth term for age in \mu (just to illustrate P-splines)

  • Compute fitted values \hat\mu (\mu is the median of the BCTo distribution)

  • Vary the penalty in the spline (df) to see the effect

m1 <- gamlss(bmi ~ pb(age), data=dbbmi, family=BCTo, trace = FALSE) # default penalty
m2 <- gamlss(bmi ~ pb(age, df=3), data=dbbmi, family=BCTo, trace = FALSE)
m3 <- gamlss(bmi ~ pb(age, df=15), data=dbbmi, family=BCTo, trace = FALSE)

# compute fitted values for mu
p1 <- predict(m1, what="mu", type = "response")
p2 <- predict(m2, what="mu", type = "response")
p3 <- predict(m3, what="mu", type = "response")

dbbmi <- cbind(dbbmi, p1, p2, p3)

Example: Dutch boys’ BMI, again

ggplot(dbbmi, aes(x=age, y=bmi)) +
  geom_point(shape = 21, fill = "lightgray", color = "black", size = 0.05) +
  geom_smooth(aes(x=age, y=p1), size=0.5, colour = "blue") +
  geom_smooth(aes(x=age, y=p2), colour = "red", size=0.5) +
  geom_smooth(aes(x=age, y=p3), colour = "green", size=0.5) +
  theme_bw()